MULTISCALE 

MODELING 


Improving NASA’s Multiscale 
Modeling Framework for Tropical 
Cyclone Climate Study 

One of the current challenges in tropical cyclone research is how to improve our 
understanding of TCs' interannual variability as well as climate change's impact. Modern 
advances in global modeling , visualization , and supercomputing technologies at NASA 
show potential, but scalability is an issue. Recent improvements to the multiscale modeling 
framework make long-term TC-resolving simulations much more feasible. 


S tudies of tropical cyclone interannual 
variability and the impact of climate 
change on TCs have received increas- 
ing attention, particularly as during the 
past 10 years statistics indicate that TCs are the 
deadliest weather events in the US (www.nws. 
noaa.gov/os/hazstats.shtml). Improving short- 
and long-term hurricane forecasts is imperative. 
Depending on their location, TCs can be called 
other names, such as hurricanes (in the Atlantic 
region), typhoons (in the West Pacific region), 
tropical storms, cyclonic storms, and tropical 
depressions. 

TC dynamics involve multiscale processes. To 
accurately simulate a TC’s evolution on a medium, 
or meso, scale, an ideal solution would be to im- 
prove the model to realistically capture both the 
downscaling processes associated with large-scale 
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flows, such as Madden-Julian oscillations or tropi- 
cal waves, and the upscaling processes associated 
with the feedback from small-scale flows, such 
as convection or precipitation. However, it has 
been challenging to accurately simulate TC ac- 
tivities with traditional numerical models because 
of artificial scale separations. For example, Earth 
atmospheric-modeling activities are convention- 
ally divided into three major categories: global, 
or large, scale; mesoscale, and cloud, or micro, 
scale. As Figure 1 shows, typical resolutions for 
the global, mesoscale (or regional), and cloud 
models are on the order of 100 km, 10 km, and 
1 km, respectively. 

Due to limited access to computing resourc- 
es, researchers have had to conduct TC climate 
studies primarily with coarse-resolution gen- 
eral circulation models (GCMs) 1 and partially 
with fine-resolution, regional mesoscale models 
(MMs). The former have the advantage of simu- 
lating the impact of global-scale flows on TC 
activities but might not accurately simulate meso- 
and cloud-scale flows. In contrast, the latter make 
it possible to simulate realistic TC intensity and 
structure with fine grid spacing, but this ap- 
proach has problems capturing the accurate im- 
pact of large spatial- and temporal-scale flows. 
Furthermore, the resolutions used in GCMs and 
MMs are still too coarse to resolve small-scale 
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Predictability of Tropical 
Cyclone Genesis 

B y examining the role of different tropical waves in trop- 
ical cyclogenesis with observations, William Frank and 
Paul Roundy 1 found a strong relationship between tropical 
cyclone (TC) formation and enhanced activity in equato- 
rial Rossby waves, mixed Rossby gravity waves, 2 tropical 
depression (TD)-type disturbances (or easterly waves), and 
Madden-julian oscillations (MJOs). 3 With a 45- to 60-day 
time scale, eastward-propagating MJOs, which are typically 
characterized by deep convection originating over the 
Indian Ocean, have one of the most prominent large-scale 
features of the tropical general circulation. These large- 
scale tropical systems could provide determinism with re- 
gards to TC genesis timing and location. Thus, it becomes 
possible to extend the lead time for TC genesis prediction 
and thus increase our confidence in the model's ability to 
simulate TC climates as long as the model can improve 
the simulations of the precursor and its modulation on TC 
activities. More information on modeling the association 
between TC genesis and these large-scale tropical systems 
appears elsewhere. 4-7 
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Figure 1. The three major categories of Earth atmospheric modeling (from left to right): global (large) 
scale, meso (medium) scale, and cloud (micro) scale. Typical resolutions for these models are roughly 
100 km, 10 km, and 1 km, respectively. 


convective motion in TC climate studies, forcing 
researchers to use cumulus parameterizations (CPs) 
to emulate the effects of unresolved subgrid- 
scale cloud motion. Because the development of 
CPs has been slow, their performance is a major 
limiting factor in TC simulations. To overcome 
the issues with CPs, a recent trend is to take full 
advantage of modern supercomputing power to 


explicitly resolve the impact of clouds in a global 
environment. Two approaches are to deploy high- 
resolution global models 2-7 or to couple a coarse- 
resolution global model with massive copies of 
cloud models. The latter approach is known as 
superparameterization, or the multiscale model- 
ing framework (MMF), and it’s the one we focus 
on here . 8-10 Recently, we successfully integrated 


September/October 2013 


57 





both models with visualizations into the coupled 
advanced multiscale modeling and visualization 
system (CAMVis) on NASA’s Pleiades supercom- 
puter, 4-7 which shows promise for short-term and 
extended-range TC simulations. 

In the MMF, you replace the conventional CP 
at each GCM grid point with a copy of a cloud- 
resolving model (CRM) to accurately represent 
non-hydrostatic, cloud-scale convection and its 
interaction with environmental flows. Thus, the 
MMF has the advantages of both the global-scale 
processes of a GCM and the sophisticated micro- 
physical processes of a CRM, and can be viewed 
as an alternative to a global CRM. However, this 
approach poses great challenges in terms of com- 
puting resources, data storage, and data analyses 
for multidecadal simulations of global TC activi- 
ties. These challenges include, but are not lim- 
ited to, running more than 10,000 instantiations 
of the CRM with great parallel performance, 
increasing the GCM’s resolution to capture re- 
alistic TC structure, efficiently archiving mas- 
sive volumes of data, and effectively conducting 
analyses to discover the predictive relationship 
between meteorological (short-term) and clima- 
tological (long-term) events. To address the first 
two issues, we propose a revised MMF coupling 
approach to improve the model’s scalability, en- 
abling higher resolutions in both the GCM and 
CRM, reducing the time to finish TC climate 
simulations. 

Throughout this article, we adopted the 
following conventions: the name of each routine 
ends with parentheses (); the name of each variable 
is in italics; the term process is a running instance 
of model codes; and the terms task and process are 
used as follows: the former emphasizes the work 
being done, and the latter emphasizes the worker 
doing it (for example, task A is performed using 
process A). 

NASA's MMF 

The NASA Goddard MMF 9,10 is based on the 
NAS AGoddard finite-volume GCM (fvGCM)and 
the Goddard cumulus ensemble (GCE) model. 
While the high-resolution fvGCM has shown 
remarkable capabilities in simulating large-scale 
flows, and thus hurricane tracks, 2-7,11 the GCE 
is well known for its superior performance in 
representing small, cloud-scale motions and has 
been used to produce more than 100 referred 
journal papers. 12,13 In the MMF, the fvGCM 
runs at a coarse (2° x 2.5°) resolution, and 
one copy of the GCE runs within each of the 
fvGCM grids. 10 


Researchers still widely use the 2° x 2.5° reso- 
lution in climate simulations with conventional 
climate models and recent MMFs because it’s com- 
putationally affordable. This resolution has grid 
points of (91, 144) in the (y, x) directions, giving 
a total of 91 x 144 (13,104) horizontal cells. Thus, 
13,104 GCEs are “embedded” in the fvGCM to 
allow explicit simulation of cloud processes in 
a global environment. Currently, only averaged 
thermodynamic fields such as temperature and wa- 
ter vapor in the GCEs are fed back to the fvGCM, 
in a process called thermodynamic feedback. The 
timestep for the individual 2D GCE is 10 seconds, 
and the fvGCM-GCE coupling interval is one 
hour at this resolution. Under this configuration, 
99 percent or more of the total wall time for run- 
ning the MMF is spent on the GCEs. Thus, wall 
time could be significantly reduced by efficiently 
distributing the large number of GCEs over a 
massive number of processors on a supercomputer. 

Let’s look more closely at the computational 
parts of the GCE and fvGCM, before discussing 
a revised strategy for coupling the latter with 
massive copies of the GCE to improve scalability. 

The GCE Model 

Typical MMF model runtime configurations 
are 64 grid points in the x direction, with a grid 
spacing of 4 km; 32 vertical stretched levels; cy- 
clic lateral boundary conditions; and a time step 
of 10 seconds. The GCE itself has been imple- 
mented with a 2D domain decomposition using 
message-passing interface version 1 (MPI-1) with 
good parallel efficiency. 14 Thus, an ideal solution 
for the course-grain parallelism implemented in 
the MMF is to run more copies of GCEs with 
higher CPU counts in parallel, while still keeping 
the option of taking advantage of the fine-grain 
parallelism inside the GCE. 

The fvGCM 

Resulting from a development effort of more 
than 15 years, the fvGCM is a unified numerical 
weather prediction (NWP) and climate model 
that can run on daily, monthly, decadal, or century 
time scales. 1115 The 1990s model was originally 
designed for climate studies at a coarse resolution 
of approximately 2x2.5 degrees, and its resolution 
was increased to 1 degree in 2000 and Vi degree in 
2002 for NWP. Since 2005, the high-resolution — 
that is, 1/8 and 1/12 degree — fvGCM has been 
deployed on NASA’s Columbia supercomputer, 
showing remarkable TC forecasts. 2 

The fvGCM’s parallelization was carefully de- 
signed to achieve efficiency, parallel scalability, 
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Figure 2. Parallelism in the finite-volume general circulation model (fvGCM) and the multiscale modeling framework (MMF) 
version 1.0. The fvGCM has the message-passing interface (MPI) parallel implementation with a ID domain decomposition 
along the y direction. The approach of embedding one copy of Goddard cumulus ensemble (GCE) into each of the fvGCM's 
grids inherits the parallelism and thus limited scalability of the fvGCM. The right panel shows the distribution of tasks over six 
processing elements (PEs). Task / (here, / = 1 ~ 5) performs 2,160 copies of GCEs, while task 6 has 2,304 copies of GCEs: Z(/) = 
2,160, /= 1-5; Z( 6) = 2,304; £^ =1 Z( /) = 13,104, with P = 6 in this case. 


flexibility, and portability. Its implementation had 
distributed- and shared-memory, two-level paral- 
lelism, including a coarse-grain parallelism with 
MPI and fine-grain parallelism with OpenMP 
(throughout this article, we refer to MPI as any 
one of MPI-1, MPI-2, MLP, or SHMEM com- 
munication paradigms). 16 However, the latter isn’t 
applicable for current MMF runs. Because MPI 
parallelism is applied to a ID decomposition over 
latitude in the fvGCM, and each MPI task needs 
at least three grid points in latitude for parallel ef- 
ficiency, MPI parallelism is very limited for small 
computational grids (see Figure 2). Assume NY is 
the number of grid points in the y direction. In 
general, NY is equal to (1807DF + 1), where DY is 
an increment in latitude that can be 0.08°, 0.125°, 
0.25°, 0.5°, 1°, or 2°. For the 2 x 2.5 degree grids 
where NY =91 (with DY = 2), the maximum of 
MPI tasks is only 30 (-91/3). In addition, because 
NY generally isn’t divisible by the number of se- 
lected processes (P), the domain decomposition 
is nonuniform, which leads to load unbalancing 


because some MPI processes receive more lati- 
tudes than others. 

Revised Parallelism in the MMF 

We discuss the revised parallelism by first intro- 
ducing the metaglobal GCE (mgGCE) and then 
presenting the details on parallel implementation, 
including the creation of two process groups — 
process mapping and load balancing — and the 
dynamical distribution of mgGCE inputs over 
mgGCE processes. 

The Metaglobal GCE 

In MMF version 1, each of 13,104 GCEs is em- 
bedded into an fvGCM grid cell. Although the 
implementation of this version is straightforward, 
by adopting the fvGCM’s parallel framework, it 
inherits both the fvGCM framework’s advantages 
and disadvantages. One of the disadvantages is 
that this approach limits the MMF’s parallel seal- 
ability — that is, it can only decompose the whole 
domains into small-number subdomains and each 
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Figure 3. A conceptual diagram of the revised parallelism 
implementation using six fvGCM tasks and 19 megaglobal GCE 
(mgGCE) tasks. In the original 1 D decomposition, each fvGCM process 
(top left panel) gets three or more latitudes (or rows) and computes 
the mgGCEs for each longitude (or column) sequentially within a row. 
The new version adds a second level of parallelism, decomposing a 
single row into individual columns (top right panel). Each fvGCM 
process has an associated set of mgGCE processes that compute the 
GCEs for a row in parallel. More mgGCE processes are assigned to the 
fvGCM process with more rows (that is, more copies of the GCEs). The 
1 3,1 04 GCE copies are distributed over these mgGCE processes (and 
the fvGCM processes as needed), achieving an effective 2D domain 
decomposition shown in the bottom panel. This revised parallelism 
can improve the MMF's scalability by allowing more copies of GCEs 
running in parallel, thus reducing wall time significantly. 


requires many copies of GCEs to run within one 
processing element (PE). For example, 2,160 cop- 
ies of the GCE run in series in one PE when we 
use only six PEs (see Figure 2). Let’s assume we 
use the maximum number PEs (30), and each 
PE performs one copy of the GCE at a time and 
needs to run 432 copies of the GCEs sequentially. 
Therefore, making more copies of the GCEs to 
run in parallel with more PEs is the key to reduc- 
ing the wall-clock time. 

From a computational perspective, the concept 
of embedding GCEs into the fvGCM restricts 
the MMF’s view of the parallelism — namely, it 
can only inherit it from the fvGCM. Because a 
GCE uses a periodic, lateral boundary condition, 
each embedded GCE’s execution is independent 
of the other GCEs during a time step of the 


fvGCM model, and thus the GCE can run in a 
separate MPI task. Accordingly, we propose a new 
coupling approach to improve the MMF’s paral- 
lel scalability. Conceptually, we refer to the 13,104 
(144 x 91) copies of GCEs as a supercomponent 
called a metaglobal GCE (mgGCE) in a meta 
grid-point system. To facilitate discussion, we as- 
sume this grid system is the same as the latitude- 
longitude grid structure in the fvGCM, although 
it isn’t necessarily tied to any specific grid system. 

With this concept in mind, each of the 
two individual components (the fvGCM and 
mgGCE) in the MMF could have its own 
domain decompositions. Note that because each 
GCE uses cyclic, lateral boundary conditions, 
there’s no data communication between any two 
GCEs. In other words, the mgGCE has no ghost 
region, which is defined as the edge points of the 
computing subdomain in one processor that store 
the data belonging to adjacent processors. Thus, 
a 2D domain decomposition in the mgGCE 
can significantly reduce the runtime for massive 
copies of the GCE, which can greatly improve 
MMF scalability because most of the wall time is 
spent on these GCEs. 

To couple the fvGCM and mgGCE, the paral- 
lelism is implemented to do the following: create 
two groups of processes with the MPI intercom- 
munication, one group with P fvGCM processes 
and the other with Q mgGCE processes; build 
a sophisticated static mapping between the P 
fvGCM and Q mgGCE processes; and dynami- 
cally distribute input values to mgGCE process- 
es from the fvGCM processes that can execute 
13,104 copies of the GCEs and handle load bal- 
ancing. This implementation leads to an effective 
2D domain decomposition in the mgGCM, while 
the original ID domain decomposition remains in 
the fvGCM. Simply speaking, the first-level par- 
allelism decomposes latitudes in the fvGCM, and 
the second-level parallelism decomposes longi- 
tudes in the mgGCE. Figure 3 offers a schematic 
diagram of these domain decompositions; we dis- 
cuss the technical details in the next section. 

Parallel Implementation 

We implemented fvGCM parallelism with MPI 
to allow point-to-point, collective communication 
among processes in the same group. This kind 
of “conventional” communication, which also 
appears in many existing MPI single-program 
multiple-data codes, is called intracommunication. 
In contrast, intercommunication occurs among 
processes in local and remote groups, where a lo- 
cal group is one within which a process initiates an 
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intercommunication operation — the local group 
is the sender (or receiver) in a send (or receive) 
call. A remote group is one that contains the tar- 
get process, which is the receiver (or sender) in a 
send (or receive) call. Note that these two groups 
don’t overlap: when the target process needs to be 
addressed, the local process uses an (intercom- 
municator, rank) pair with the rank relative to the 
remote group. 

Because of its unique features, we used inter- 
communication in the revised parallelism imple- 
mentation to couple the fvGCM and the mgGCE. 
The main program’s calling tree, fvGCM.F, con- 
tains the following steps to finish an MMF run: 



1. Amaster process calls the mmf_gce2d_task_ 
init ( ) to create two groups of processes 
and intercommunicators. One group has P 
fvGCM processes and the other has Q mg- 
GCE processes. Inside this subroutine, the 
mgGCE processes start waiting for inputs to 
perform GCE calculations and interact with 
the fvGCM processes to receive their inputs 
in step 5; all fvGCM processes return back to 
the main program to continue the execution. 

2. All fvGCM processes call mp_init() 
and y_decomp()to perform a ID domain 
decomposition. 

3. Each of the fvGCM processes calls the 
inmf_gce2d_task_assignment ( ) to asso- 
ciate itself with a subset of (T(J)) mgGCE 
processes. Here, J is the fvGCM process’s 
rank, and T(J) is proportional to the lati- 
tudes assigned to the fvGCM process, aimed 
at achieving load balancing. 

4. All fvGCM processes call mmf_init()to 
initialize the MMF run. 

5. All P fvGCM processes invoke the mmf_ 
run ( ) , where the mmf_invoke_gce2d ( ) is 
called, to interact with Q mgGCE processes 
that run the gce2d_task_main ( ) to finish 
144 x 91 gec2d runs. 

6. The fvGCM processes sequentially call 
mmf_f inalize ( ) and mmf_gce2d_task_ 
finalize ( ) to finalize the fvGCM tasks and 
mgGCE tasks, respectively. 

Let’s look more closely at steps 1, 3, and 5. 

Creation of two process groups and MPI intercom- 
municators. Figure 4 displays major functions in 
the subroutine mmf_gce2d_task_init ( ) that 
create both two groups of processes by calling 
MPI subroutine mpi_comm_split ( ) and the 
intercommunicators among these two groups 
with the mpi_interomm_creat ( ) , which include 


V V 


Return back to fvGCM() ^ 



where 


gce2d_task_main () 

mmfJnvoke_gce2d() 



is called 

\ 



Figure 4. Creation of two groups of processes and their intracommunicators 
and intercommunicators in the subroutine mmf_gce2d_task_init ( ) . 
The fvGCM group appears on the left-hand side, whereas the mgGCE 
group is on the right. gce2d_intercomm and fvgcm_inter_comm 
(in red) are intercommunicators, and gce2d_intra_comm and fvGCM_ 
intracomm (in blue) are intracommunicators. 


gce2d_intercomm with the fvGCM and fvGCM_ 
inter_comm with the mgGCE processes as local 
processes. While mgGCE processes enter a loop 
of gce2d_task_main ( ) and start waiting for in- 
puts to perform GCE calculations, all fvGCM 
processes return back to the main program where 
mmf_invoke_gce2d( ) is called to send data to 
the mgGCE processes; Lists 1 and 2 (Figures 5 
and 6) display these two subroutines’ major func- 
tionalities, which are discussed later. 

The mgGCE processes are homogenous and 
undifferentiated. In fact, they’re designed to be 
stateless, meaning they don’t need to know any- 
thing about the fvGCM side of things. They don’t 
know what time step it is, who might send them 
input, or anything about load balancing or task 
assignments. They simply receive a block of input 
from the fvGCM processes, use that input to call 
the gce2d ( ) routine, and then return a block of 
output back to whomever sent them the input. Ev- 
erything they need to know is either constant or 
provided within the block of input. 

Process mapping and load balancing. For a given 
total_mimber_of_processes ( P + Q ) and total _num- 
ber_of_mgGCE_processes ( Q ) at runtime, the map- 
ping between P fvGCM and Q mgGCE processes 
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1 . mainLoop : do 

2. call mpi_recv ( inputValues , , , MPI_ANY_SOURCE, MPI_ANY_TAG, fvGCM_inter_comm, 

status, ierror) 

3. tag = status (MPI_TAG) 

4. source = status (MPI_SOURCE) 

5. if (tag .gt. 0) then 

6. call mmf_call_gce2d( inputValues, outputValues) !! Compute gce2d() 

7. !! Pass the outputs back to the sender, using the supplied tag 

8. call mpi_send( outputValues, , MPI_BYTE, source, tag, fvGCM_inter_comm, ierror) 

9. else if (tag .eq. 0)then 

10 . ! ! Exit the do-loop and terminate 

11. exit mainLoop 

12. else 

13. !! Unexpected tag value 

14 . end if 

1 5 . enddo mainLoop 


Figure 5. List 1: pseudocode of the gce2d_task_main ( ) routine. Each of the mgCCE processes does a simple loop: receive 
inputValues from any fvCCM process, call gce2d ( ) model, and return outputValues to the same fvGCM process that 
sent the inputValues. 


happens in the routine mmf_gce2d_task_as- 
signment ( ) illustrated in Figure 7. The fvGCM 
processes agree among themselves to carve up the 
group of mgGCE processes into disjoint subsets, 
with each fvGCM process getting one of these 
subsets. Although the assignment is static, it isn’t 
necessarily uniform. 

For simplicity, we choose these disjoint sub- 
sets to be a contiguous block of processes, where 
“contiguous” is relative to their rank numbering 
within the intercommunicator (with the mgGCE 
as a remote group). The array firstExtraGce2d- 
Task, associated with the fvGCM processes, 
gives the comm rank of the first mgGCE process 
of the subset assigned to that fvGCM process; 
numGce2dTasksAssigned is the number of mg- 
GCE processes assigned to that fvGCM process, 
where firstExtraGce2dTask and numGce2d- 
TasksAssigned are referred to as S(J ) and T(J ), 
respectively, in Figure 7, with J representing the 
fvGCM’s rank. For example, an fvGCM pro- 
cess with the rank of 13 has numGce2dTasksAs- 
signed(13) or T(13) processes, starting with 
intercomm rank f irstExtraGce2dTask ( 13 ) or 
s ( 13 ) , and going up contiguously. 7(7) is rough- 
ly equal to (Q/P) but can be larger for some of 
fvGCM processes that need more latitudes than 
the other processes. The summation of all T(J) 
should be equal to Q, namely, T( 7) = Q, 

where 5(1) = 1 and 5(7+1) = 5(7) + 7(7), J = 1 ~ 
P -1. These intercomm task index values are what 
the fvGCM process uses for the “destination” 


field of the mpi_send ( ) call, which requires an 
(intercommunicator, rank) pair with the “rank” 
relative to the remote group. 

We can achieve load balancing by giving more 
mgGCE processes to the fvGCM processes that 
have more latitudes. As shown in the top panel of 
Figure 3, if one fvGCM process has to do three 
latitudes (or rows), and a different fvGCM process 
has to do four, we give the second fvGCM pro- 
cess more mgGCE processes (that is, the larger 
numGce2dTasksAssigned or 7) than we give to 
the first. Thus, the second fvGCM process might 
be able to do the work within a single latitude 
in only three-quarters as much time, so the two 
fvGCM processes will both finish their work in 
roughly equal amounts of time. While each of 
fvGCM processes can also help perform GCE 
calculations to improve performance, no fvGCM 
process shares its work with other mgGCE pro- 
cesses outside of its assigned subset. 

Simply speaking, although the load imbalance 
was introduced in association with the ID non- 
uniform decomposition in the y direction in the 
original MMF, it can be mitigated or resolved by 
another ID nonuniform decomposition in the x 
direction in the new MMF. 

Dynamical distribution of mgCCE inputs over mgGCE 
processes. Lists 1 and 2 (Figures 5 and 6) provide 
pseudocodes to show how mgGCE and fvGCM 
processes interact to perform the calculations of 
13,104 GCE copies. List 1 displays a main loop in 
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1. 

sendLoop: do d=l, depth 


2. 

do i=l, numGce2dTasks 


3. 

tag = 2 * nextColumn 


4. 

call mpi_isend (inputValues (nextColumn) , ..., tag, gce2d_intercomm, 
(nextColumn) , ierror) 

request 

5. 

nextColumn = nextColumn +1 


6. 

enddo 


7 . 

enddo sendLoop 


8. 

sendRecvLoop : do while (nextColumn .le. NumLongitudes ) 


9 . 

call mpi_iprobe (MPI_ANY_SOURCE, MPI_ANY_TAG, gcen2d_intercomm, pending, status. 


ierror) 


10. 

If (pending) then 


11. 

whichTask = status (MPI_SOURCE) 


to 

whichColumn = status (MPI_TAG) / 2 


13 . 

call mpi_recv (outputValues (whichColumn) ,.... , whichTask, ..., gce2d_ 

_intercomm ... .) 

14. 

call mpi_request_f ree (request (whichColumn) , ierror) 


15. 

tag = 2* nextColumn 


16. 

call mpi_isend ( inputValues (nextColumn) , ... tag, gce2d_intercomm, 
request (nextColumn) , ierror) 


17 . 

else 


18. 

call mmf call gce2d (tmplnput ) ! Running with fvGCM processes 


19 . 

endif 


to 

o 

numCompleted = numCompleted + 1 


21. 

nextColumn = nextColumn + 1 


22 . 

enddo sendRecvLoop 


23. 

recvLoop: do while (numCompleted . It. NumLongitudes) 


to 

call mpi_probe (MPI_ANY_SOURCE, MPI_ANY_TAG, gce2d_intercomm, pending, status. 


ierror) 


25. 

whichColumn = status (MPI_TAG) / 2 


26. 

call mpi recv (outputValues (whichColumn) , ... gce2d intercom ) 


27 . 

call mpi_request_free (request (whichColumn) , ierror) 


28. 

numCompleted = numCompleted + 1 


29 . 

enddo recvLoop 



Figure 6. List 2: pseudocode of the ramf_invoke_gce2d ( ) routine, which addresses distribution of global initial values to 
Q mgGCE processes from P fvGCM processes. The fvGCM processes send inputs to mgGCE processes (which run in parallel) 
and then receive outputs from them. The fvGCM processes continue to do so until the GCE calculation over the entire globe 
(that is, the calculation of the 13,104 GCE copies) is completed. Q is equal or larger than the multiple of P, so Q > = C * P, 
where Cis an integer. 


gce2d_task_main ( ) , where all mgGCE process- 
es wait to receive inputValues from any fvGCM 
process (line 2), call the gce2d( ) model (line 
6), and return outputValues (line 8 ) to the same 
fvGCM process that sent the inputValues. In 
short, mgGCE process does a simple loop: recv 
input - compute gee - send output. As 
mentioned, these mgGCE processes are essen- 
tially stateless, but in the current implementation, 
each mgGCE process is bound (assigned statically) 
to a single, particular fvGCM process, which is 
just a convenience for the fvGCM side. 

All the controlling business happens on the 
fvGCM side, as in List 2. Each fvGCM pro- 
cess uses the fixed and static process mapping to 


interact with the corresponding subsets of mg- 
GCE processes. In addition to handling inputs 
and outputs, the fvGCM processes can also help 
perform GCE calculations to take advantage of 
their computational potential. Therefore, the 
inputs and computations for the entire globe are 
dynamically distributed to the mgGCE processes 
via the following three phases, which correspond 
to the three loops in the routine in List 2: 

• In the send loop (lines 1-7), the fvGCM pro- 
cess sends out a lot of inputs to all of its mg- 
GCE processes, queuing up depth inputs to 
each mgGCE process. Here the variable depth 
indicates the number of inputs that are already 
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PfvGCM Processes 


Q mgGCE Processes 



Figure 7. Mapping between PfvGCM process (left) and Q mgGCE processes (right). The fvGCM process 
with the rank of j receives a subset of Q mgGCE processes, say, T(/). The choice of T(/) depends on the 
amount of work (that is, the number of latitudes) assigned to the fvGCM task, and 5^y = i7T/) = Q- Let 5(/) 
be the rank of the first mgGCE process (the starting index) in the )th subset of mgGCE processes, where 
5(1)= 1 and 5(/+ 1) = 5(/) + T(/), / = 1, 2,... P- 1. 



Figure 8. Dynamic distribution of mgGCE input values. Here, we 
assume that one specific fvGCM process (at the top, in yellow) has 
three mgGCE processes (at the bottom, in brown). The fvGCM 
process sends inputs to each of the three mgGCM processes, which 
begin processing them (brown). The fvGCM process also queues 
additional inputs (blue) for the mgGCE processes. While waiting for 
results from the mgGCE process, the fvGCM process can also help 
perform a GCE calculation (middle, in yellow). When computations 
with inputs in brown are done, the mgGCE process returns results and 
begins processing the next input in its queue. On receiving a result, 
the fvGCM process queues another input (gray) to whichever mgGCE 
process returned the result. 

sent asynchronously to each mgGCE process’s 
input queue. 

• In the sendRecv loop (lines 8-22), the fvGCM 
process checks if any of its mgGCE processes 
have produced an output, and if so, sends anoth- 
er input piece of work to the mgGCE’s queue. 
However, if none of the mgGCE processes 


has produced an output, rather than waiting for 
the outputs from the mgGCE processes, one 
of the fvGCM process pitches in and does 
the next piece of work: running a copy of 
gce2d(). Then it goes back to checking for 
output. The fvGCM processes repeat the 
above until all the input has been sent, using 
nextColumn to keep track. 

• In the recvLoop (lines 23-29), the fvGCM 
processes keep receiving outputs until all the 
output has been received, using numCompleted 
to keep track. 

During these phases, the matchup between a par- 
ticular mpi_[i] send ( ) and a particular mpi_ [ i ] 
recv ( ) among fvGCM and mgGCE processes 
happens dynamically and asynchronously. Next, 
a simple illustration is given. 

Let’s assume that an fvGCM process has 
(numGce2dTasks, NumLongitudes) to be (3, 
144) in List 2. Thus, it interacts with three mg- 
GCE processes (we call them AA, BB, and CC in 
Figure 8) to dynamically distribute 144 inputs and 
finish 144 copies of GCE runs. With depth=3 
in the sendLoop (line 1 in List 2), the fvGCM 
process initiates nine (depth*numGce2dTasks) 
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Figure 9. Parallel scalability of the MMF version 2.0 with a revised parallel implementation on the NASA 
Pleiades supercomputer. This figure shows that a speedup of nearly 80x is obtained as the number of cores 
increases from 30 to 3,335. Note that the original MMF could use only 30 cores. 


mpi_isend( ) calls (line 4) to send inputValue 
records as follows: the first to AA, the second to 
BB, the third to CC, the fourth to AA, the fifth 
to BB, the sixth to CC, the seventh to AA, the 
eighth to BB, and the ninth to CC. This ends 
the sendLoop, and each of the three mgGCE 
processes now has three (depth=3) input records 
queued up. 

Each of these mgGCE processes starts run- 
ning the subroutine gce2d() as soon as it re- 
ceives input, so the fvGCM process now enters 
the sendRecv loop (between lines 8-22) and calls 
mpi_iprobe ( ) (line 9) to check for outputs sent 
from any of the mgGCE processes. Let’s sup- 
pose that none has done so. Rather than wait, 
the fvGCM process takes the next piece of input 
(the 10th) and calls gce2d() (line 18). When 
the fvGCM process finishes and returns, it again 
checks for outputs from any mgGCE processes 
(line 9). Suppose BB has finished one gce2d() 
calculation, and so BB has returned the output 
for the second piece. BB then immediately begins 
work on its next piece, namely, the fifth piece. As 
soon as the fvGCM process sees the existence of 
a pending message (the outputs from the mg- 
GCE), it calls mpi_recv ( ) (line 13) to receive 
the outputs. The fvGCM process can tell it’s the 
second piece by looking at the tag value and that 
it came from process BB. So the fvGCM process 


sends the next piece of input (the 11th) to BB 
(line 16) and then returns back to the beginning 
of the loop in line 9 to check for outputs from any 
mgGCE processes. This loop continues until all 
144 pieces of input have been handed out. The 
distribution of inputs and GCE runs is handed 
out dynamically as each previous piece of work 
is finished. The fvGCM process now enters the 
recvLoop (line 23). There’s no more input to be 
handed out, so the fvGCM process waits until all 
the remaining pieces of work have been finished 
and returned. 

Computational Results 

To test our boosted scalability in action, we 
ran the improved MMF on the NASA Pleiades 
supercomputer, which is currently one of the 
most powerful general-purpose supercomputers 
in the world. 

The Pleiades supercomputer is an SGI Altix ICE 
system with a peak performance of 2.88 Pflop/s. 
With 417 Tbytes of total memory, and 162,496 
cores (plus 64 GPU nodes, each with 512 CUDA 
cores.), it achieves Linpack performance of 1.24 
Pflop/s (as of June 2013). The system contains 
four different types of Intel Xeon processors — 
E5-2680v2 (Ivy Bridge), E5-2670 (Sandy Bridge), 
X5670 (Westmere), and X5570 (Nehalem) — to 
reach different needs and capacities on different 
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NASA projects. We used Nehalem processors in 
our benchmark. 

Figure 9 shows a benchmark with encouraging 
parallel scalability up to 3,335 CPUs on Pleia- 
des. Here, the speedup is determined by T^q/T, 
where T is the wall time to perform a five-day 
forecast with the MMF, and T 30 is the time 
spent using 30 CPUs. We chose the run with 30 
CPUs as a baseline because this configuration 
was previously used for production runs . 9,10 We 
obtained a speedup of 3.42, 10.81, 19.46, 38.46, 
50.16, and 79.77 by increasing the number of 
CPUs from 30 to 91, 273, 546, 1,115, 1,680, 
and 3,335 cores, respectively. As the baseline 
has load imbalances and excessive memory us- 
age in the master process, it isn’t surprising to 
obtain a superlinear speedup with lower CPU 
counts up to 1,115. 

Further analysis of the MMF’s throughput in- 
dicates that it takes roughly 41 minutes to finish 
a five-day forecast using 3,335 cores, which meets 
the requirement for performing real-time nu- 
merical weather prediction — that is, completion 
of a run within one hour. A three-year simulation 
would only take one day to run with 3,335 cores, 
as opposed to about 80 days with 30 cores. This 
speedup in wall time makes it far more feasible 
for increasing the resolution in the fvGCM and 
studying TCs’ interannual variability. 

I deally, the parallelism that leads to an ef- 
fective 2D domain decomposition in the 
mgGCE can be applied to other types of 
column-based physics parameterizations. 
The approach we described here also lays the 
groundwork for more sophisticated modeling to 
solve extraordinarily complex problems with ad- 
vanced computing power. For example, improved 
scalability makes it possible to deploy a more ad- 
vanced MMF with the fvGCM at a higher resolu- 
tion, say, 0.25° x 0.36°, which has much larger grid 
points (721 x 1,000) and thus requires many more 
copies of GCEs. In addition, by taking load bal- 
ance into consideration, the current implementa- 
tion makes it feasible to choose a variety of GCEs 
in the mgGCE. Further improvements include 
the implementation of an efficient I/O module 
or a parallel I/O module with a CPU layout that 
could differ from that in either the fvGCM or the 
mgGCE. 

The current parallel implementation in the 
MMF is a coarse-grained parallelism compared 
to the parallelism inside a GCE. Because an in- 
dividual GCE was previously implemented with 


its native 2D domain decomposition, another 
level of parallelism (fine-grain parallelism) inside 
each copy of the GCE could greatly expand the 
number of CPUs for MMF runs. Potentially, the 
coupled MMF, along with the mgGCE, could be 
scaled at a multiple of 13,104 CPUs, which is a 
subject for future study. se 
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